Project description:
For our first project, we're going to take a look at SAT and ACT scores around the United States. Suppose that the College Board - the organization that administers the SAT - seeks to improve the participation rate of its exams. Your presentation should be geared toward non-technical executives with the College Board and you will use, at minimum, the provided data to make recommendations about how the College Board might work to increase the participation rates of these exams.
# Let's get all our helper libraries loaded first
import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
sns.set_style("whitegrid", {'axes.grid' : False})
# %config InlineBackend.figure_format = 'retina'
%matplotlib inline
We load the SAT and ACT data into two dataframes. We also load in state-by-state enrollment data scraped from "governing.com":governing.com.
# Read the data into DataFrames
sat = pd.read_csv("../data/sat.csv")
act = pd.read_csv("../data/act.csv")
# Enrollment data was scraped from
# http://www.governing.com/topics/education/gov-fragmented-school-districts-consolidation.html
enroll = pd.read_csv("../data/enrollment.csv")
sat.head(3)
act.head(3)
For the SAT:
For the ACT:
# act.info
print("There are {} NaNs in the dataframe.".format(act.isnull().sum().sum()))
act.describe()
sat.describe()
print("There are {} NaNs in the dataframe.".format(act.isnull().sum().sum()))
The data looks complete as there are no missing (null) entries.
However:
We will investigate these 'abnormal' values a little further along in the notebook.
# Check consistency of data
# sat_sections = ["Math", "Evidence-Based Reading and Writing"]
# act_sections = ["Math", "English", "Reading", "Science"]
# SAT sections should add up to SAT total
check_sat = abs(sat.Math + sat["Evidence-Based Reading and Writing"] - sat.Total)
for ix,val in check_sat.iteritems():
if val > 1:
print("State {} ({}): math+ebrw-Total = {}".format(ix, sat.loc[ix,'State'], val))
# ACT sections should add up to add up to 4 times the ACT Composite
check_act = abs(act.Math + act.English + act.Reading + act.Science - act.Composite*4)
for ix,val in check_act.iteritems():
if val > 1:
print("State {} ({}): math + english + reading + science -4*Composite = {}"
.format(ix, act.loc[ix,'State'], val))
The scores for Maryland are inconsistent. We can change the incorrect scores to ensure the data is consistent..
# Clean data by hand
sat.loc[20, "Math"] = sat.Total[20] - sat.loc[20, "Evidence-Based Reading and Writing"]
act.loc[21, "Science"] = act.Composite[21]*4 - act.Reading[21] - act.English[21] - act.Math[21]
# ACT column types
print("ACT column types:")
act.dtypes
# SAT column types
print("SAT column types:")
sat.dtypes
We can do some cleaning:
Ideas for improvement:
# Remove the unamed columns from both and the national from act
sat.drop('Unnamed: 0', axis=1, inplace=True)
act.drop('Unnamed: 0', axis=1, inplace=True)
# Only keep columns we need
enroll = enroll[['State','Enroll']]
# Change the row index to be state
sat.set_index("State", inplace=True)
act.set_index("State", inplace=True)
enroll.set_index("State", inplace=True)
# Save the national row just in case
act_national = act.loc["National",:]
act.drop('National', axis=0, inplace=True)
# Change participation of SAT, ACT to be a float
sat['Participation'] = sat['Participation'].map(lambda cell: cell.replace('%',''))
act['Participation'] = act['Participation'].map(lambda cell: cell.replace('%',''))
sat['Participation'] = sat['Participation'].astype(float)
act['Participation'] = act['Participation'].astype(float)
enroll['Enroll'] = enroll['Enroll'].map(lambda cell: cell.replace(',',''))
enroll['Enroll'] = enroll['Enroll'].astype(float)
# Approximation of number of seniors
enroll['Enroll'] = enroll['Enroll']/13
# Merge the data on the state column
df = pd.merge(sat, act, on='State')
df = pd.merge(df,enroll, on='State')
# Rename the columns
df = df.rename(columns={
'Participation_x': 'Participation_SAT',
'Evidence-Based Reading and Writing' : "EBRW_SAT",
'Math_x' : 'Math_SAT',
'Total' : 'Total_SAT',
'Math_y' : 'Math_ACT',
'English' : 'English_ACT',
'Reading' : 'Reading_ACT',
'Science' : 'Science_ACT',
'Composite' : 'Composite_ACT',
'Participation_y': 'Participation_ACT'
})
# Approximate number students taking test
df['Participation_SAT_count'] = df.Enroll * df.Participation_SAT /100
df['Participation_ACT_count'] = df.Enroll * df.Participation_ACT /100
for c in df.columns:
print("The min and max of column {} is {} and {}.".format(c,df[c].min(), df[c].max()))
$$\sigma = \sqrt{\frac{1}{n}\sum_{i=1}^n(x_i - \mu)^2}$$
# Function to calculate standard deviation
def my_std(values):
mu = np.mean(values)
return np.sqrt(sum([(v-mu)**2 for v in values])/len(values))
sd = df.apply(lambda x: my_std(x))
# Sort DataFrame
df = df.sort_values(by="Participation_SAT", ascending=False)
df.head()
# Sort DataFrame
df = df.sort_values(by="Participation_ACT", ascending=False)
# df.head(18)
df.head()
# Sort DataFrame
columns = ["Enroll", "Participation_SAT", "Participation_ACT", "Participation_SAT_count", "Participation_ACT_count"]
df_enroll = pd.DataFrame(df, columns=columns)
df_enroll = df_enroll.sort_values(by="Enroll", ascending=False)
df_enroll.head(15)
print("Approximate participation counts:")
print("ACT: {}".format(int(df.Participation_ACT_count.sum())))
print("SAT: {}".format(int(df.Participation_SAT_count.sum())))
df[df["Participation_SAT"] > 70]
sat_color = "#32b165"
act_color = "#a38cf4"
reg_color = '#f67088'
reg2_color = '#96a331'
# Set up the matplotlib figure
f, axes = plt.subplots(ncols=2, figsize=(15, 5))
sns.despine(left=True)
plt.subplots_adjust(hspace = .5)
ax=axes[0]
ax.set_title('SAT participation rates')
sns.distplot(df.Participation_SAT, bins=10, color=sat_color,kde=False, ax=ax)
#, hist_kws={"color": current_palette})
ax.set_xlabel('Participation Rates')
ax.set_ylabel('Number of states');
ax=axes[1]
ax.set_title('ACT participation rates')
sns.distplot(df.Participation_ACT, bins=10, color=act_color, kde=False, ax=ax)
#, hist_kws={"color": current_palette})
ax.set_xlabel('Participation Rates')
ax.set_ylabel('Number of states');
# Set up the matplotlib figure
f, axes = plt.subplots(2,2 , figsize=(15, 10))
sns.despine(left=True)
plt.subplots_adjust(hspace = .5)
ax=axes[0,0]
ax.set_title('Math SAT scores')
sns.distplot(df.Math_SAT, bins=10, color=sat_color,kde=False, ax=ax)
ax.set_xlabel('Math Scores');
ax=axes[0,1]
ax.set_title('Math ACT scores')
sns.distplot(df.Math_ACT, bins=10, color=act_color,kde=False, ax=ax)
ax.set_xlabel('Math Scores');
ax=axes[1,0]
ax.set_title('Science ACT scores')
sns.distplot(df.Science_ACT, bins=10, color=act_color,kde=False, ax=ax)
ax.set_xlabel('Science Scores');
ax=axes[1,1]
ax.set_visible(False)
# Set up the matplotlib figure
f, axes = plt.subplots(2,2, figsize=(15, 10))
sns.despine(left=True)
plt.subplots_adjust(hspace = .5)
ax=axes[0,0]
ax.set_title('EBRW SAT scores');
sns.distplot(df.EBRW_SAT, bins=20, color=sat_color, kde=False, ax=ax);
ax.set_xlabel('EBRW SAT Scores');
ax=axes[0,1]
ax.set_title('English ACT scores')
sns.distplot(df.English_ACT, bins=10, color=act_color,kde=False, ax=ax)
ax.set_xlabel('English ACT Scores');
ax=axes[1,0]
ax.set_title('Reading ACT scores');
sns.distplot(df.Reading_ACT, bins=20, color=act_color, kde=False, ax=ax);
ax.set_xlabel('Reading ACT scores');
ax=axes[1,1]
ax.set_visible(False)
# Set up the matplotlib figure
f, axes = plt.subplots(ncols=2, figsize=(15, 5))
sns.despine(left=True)
plt.subplots_adjust(hspace = .5)
ax=axes[0]
ax.set_title('Total SAT scores')
sns.distplot(df.Total_SAT, bins=10, color=sat_color, kde=False, ax=ax);
ax.set_xlabel('Total Scores');
ax=axes[1]
ax.set_title('Composite ACT scores');
sns.distplot(df.Composite_ACT, bins=10, color=act_color, kde=False, ax=ax);
ax.set_xlabel('Composite Scores');
The most common assumption is that the data is normally distributed. This somesn't seem to hold true for any of our columns.
# Set up the matplotlib figure
f, axes = plt.subplots(3,2, figsize=(15, 15))
sns.despine(left=True)
plt.subplots_adjust(hspace = .5)
ax=axes[0,0]
ax.set_title("Participation rates by state");
sns.regplot(x="Participation_SAT", y="Participation_ACT", data=df, color=reg_color, ax=ax)
ax.set_xlabel("SAT")
ax.set_ylabel("ACT")
ax=axes[0,1]
ax.set_title("Participation counts by state");
sns.regplot(x="Participation_SAT_count", y="Participation_ACT_count", data=df, color=reg2_color, ax=ax);
ax.set_xlabel("SAT")
ax.set_ylabel("ACT")
ax=axes[1,0]
ax.set_title("English score comparison for SAT vs ACT");
ax.scatter(df.EBRW_SAT, df.English_ACT, color=reg_color)
ax.scatter(df.EBRW_SAT, df.Reading_ACT, color=reg2_color)
ax.set_xlabel("SAT EBRW")
ax.set_ylabel("ACT English and Reading")
ax.legend()
ax=axes[1,1]
ax.set_title("Math score comparison for each state");
ax.scatter(df.Math_SAT, df.Math_ACT, color=reg_color)
ax.scatter(df.Math_SAT, df.Science_ACT, color=reg2_color)
ax.set_xlabel("SAT Math")
ax.set_ylabel("ACT Math and ScienceT");
ax.legend()
ax=axes[2,0]
ax.set_title("Overall score comparison by state");
ax.scatter(df.Total_SAT, df.Composite_ACT, color=reg_color)
ax.set_xlabel("Total SAT")
ax.set_ylabel("Composite ACT")
ax.legend()
ax=axes[2,1]
ax.set_visible(False)
# Set up the matplotlib figure
f, axes = plt.subplots(ncols=2, figsize=(15, 5))
sns.despine(left=True)
plt.subplots_adjust(hspace = .5)
ax=axes[0]
ax.set_title('SAT score comparisions')
ax.scatter(df.Total_SAT, df.EBRW_SAT, color=reg_color);
ax.scatter(df.Total_SAT, df.Math_SAT, color=reg2_color)
ax.set_xlabel('Total SAT');
ax.set_ylabel("Section scores")
ax.legend()
ax=axes[1]
ax.set_title('ACT score comparision');
ax.scatter(df.Composite_ACT, df.English_ACT, color=sat_color);
ax.scatter(df.Composite_ACT, df.Math_ACT, color=act_color)
ax.scatter(df.Composite_ACT, df.Science_ACT, color=reg_color)
ax.scatter(df.Composite_ACT, df.Reading_ACT, color=reg2_color)
ax.set_xlabel('Composite ACT');
ax.legend();
# Set up the matplotlib figure
f, axes = plt.subplots(ncols=2, figsize=(15, 5))
sns.despine(left=True)
plt.subplots_adjust(hspace = .5)
ax=axes[0]
ax.set_title('SAT total score to participation rates');
sns.regplot(x="Participation_SAT", y="Total_SAT", data=df, color=sat_color, scatter=True, ax=ax)
ax.set_xlabel('Participation rate')
ax.set_ylabel("Total SAT scores")
ax=axes[1]
ax.set_title("ACT composite to participation rates")
sns.regplot(x="Participation_ACT", y="Composite_ACT", data=df, color=act_color, scatter=True, ax=ax)
ax.set_xlabel("Participation rate")
ax.set_ylabel("Composite ACT scores");
# While fun, this doens't add much to our observations
# sns.pairplot(df);
Additionaly, the scores on the sections of the SAT are positively correlated with the total score (not surprising).
While I added approximate enrollment data, I did not have time to fully explore relationships between enrollment and the other data.
ACT_scores = ["Math_ACT", "English_ACT", "Reading_ACT", "Science_ACT", "Composite_ACT"]
SAT_scores = ["Math_SAT", "EBRW_SAT"]
participation_rates = ["Participation_SAT", "Participation_ACT"]
# Set up the matplotlib figure
f, axes = plt.subplots(ncols=2, figsize=(15, 5))
sns.despine(left=True)
plt.subplots_adjust(hspace = .5)
ax=axes[0]
ax.set_title("SAT Scores")
df.boxplot(SAT_scores, ax=ax)
ax=axes[1]
ax.set_title("ACT Scores")
df.boxplot(ACT_scores, ax=ax);
# Compare participation rates
plt.title("participation_rates")
df.boxplot(participation_rates);
# Modified from example code from plotly
import plotly
import plotly.plotly as py
import pandas as pd
plotly.tools.set_credentials_file(username='gbkgwyneth', api_key='NS01qFBPZLMjc5Kb6AEa')
scl = [[0.0, 'rgb(147, 208, 191)'],[0.2, 'rgb(114, 191, 196)'],[0.4, 'rgb(82, 174, 201)'],\
[0.6, 'rgb(64, 149, 181)'],[0.8, 'rgb(59, 116, 138)'],[1.0, 'rgb(55, 83, 94)']]
# From Stack Overflow
# https://stackoverflow.com/questions/48979352/choropleth-map-in-python-using-plotly-without-state-codes
state_codes = {
'District of Columbia' : 'DC','Mississippi': 'MS', 'Oklahoma': 'OK',
'Delaware': 'DE', 'Minnesota': 'MN', 'Illinois': 'IL', 'Arkansas': 'AR',
'New Mexico': 'NM', 'Indiana': 'IN', 'Maryland': 'MD', 'Louisiana': 'LA',
'Idaho': 'ID', 'Wyoming': 'WY', 'Tennessee': 'TN', 'Arizona': 'AZ',
'Iowa': 'IA', 'Michigan': 'MI', 'Kansas': 'KS', 'Utah': 'UT',
'Virginia': 'VA', 'Oregon': 'OR', 'Connecticut': 'CT', 'Montana': 'MT',
'California': 'CA', 'Massachusetts': 'MA', 'West Virginia': 'WV',
'South Carolina': 'SC', 'New Hampshire': 'NH', 'Wisconsin': 'WI',
'Vermont': 'VT', 'Georgia': 'GA', 'North Dakota': 'ND',
'Pennsylvania': 'PA', 'Florida': 'FL', 'Alaska': 'AK', 'Kentucky': 'KY',
'Hawaii': 'HI', 'Nebraska': 'NE', 'Missouri': 'MO', 'Ohio': 'OH',
'Alabama': 'AL', 'Rhode Island': 'RI', 'South Dakota': 'SD',
'Colorado': 'CO', 'New Jersey': 'NJ', 'Washington': 'WA',
'North Carolina': 'NC', 'New York': 'NY', 'Texas': 'TX',
'Nevada': 'NV', 'Maine': 'ME'}
state_df = pd.DataFrame.from_dict(state_codes, orient='index')
state_df.rename(columns={ 0:'Code'}, inplace=True)
df_heat = pd.concat([df,state_df], axis=1)
for col in df_heat.columns:
df_heat[col] = df_heat[col].astype(str)
def heatmap(df, col, title, scl):
data = [ dict(
type='choropleth',
colorscale = scl,
autocolorscale = False,
locations = df['Code'],
z = df[col].astype(float),
locationmode = 'USA-states',
text = df['Code'],
marker = dict(
line = dict (
color = 'rgb(255,255,255)',
width = 2
) ),
colorbar = dict(
title = "Rate")
) ]
layout = dict(
title = title,
geo = dict(
scope='usa',
projection=dict( type='albers usa' ),
showlakes = True,
lakecolor = 'rgb(255, 255, 255)'),
)
fig = dict( data=data, layout=layout )
return fig
fig_sat = heatmap(df_heat,"Participation_SAT", 'SAT Participation Rate by State', scl)
py.iplot( fig_sat, filename='d3-cloropleth-map' )
fig_act = heatmap(df_heat,"Participation_ACT", 'ACT Participation Rate by State', scl)
py.iplot( fig_act, filename='d3-cloropleth-map' )
fig_sat_count = heatmap(df_heat,"Participation_SAT_count", 'SAT Participation Count by State', scl)
py.iplot( fig_sat_count, filename='d3-cloropleth-map' )
fig_act_count = heatmap(df_heat,"Participation_ACT_count", 'ACT Participation Count by State', scl)
py.iplot( fig_act_count, filename='d3-cloropleth-map' )
fig_enroll = heatmap(df_heat,"Enroll", 'Enrollment by State', scl)
py.iplot( fig_enroll, filename='d3-cloropleth-map' )
New Hampshire and Vermot did not provide data, so they are colored with black.
# The center, the shape, and the spread.
mu=pd.Series(df.mean(), name="Mean")
std = pd.Series(df.std(), name="Std Dev")
pd.concat([mu,std], axis=1)
The distributions of the scores and participation rates are not normal and not close to uniform. I would like to have more data.
There is an inverse relationship between participation rates of the two tests. Also, average section scores on the tests are related to the average overall scores by state.
t_stat, p_value = stats.ttest_ind(df.Participation_ACT,df.Participation_SAT)
print("The t-statistic is {}.".format(t_stat))
print("The p-value is {}.".format(p_value))
# Plot the confidence interval
# Taken from lecture notebook
# Generate points on the x axis between -4 and 4:
xpoints = np.linspace(-4, 4, 500)
# Use `stats.t.pdf` to get values on the probability density function for the t-distribution.
# The second argument is the degrees of freedom: n1 + n2 - 2.
ypoints = stats.t.pdf(xpoints,
len(df.Participation_ACT)+len(df.Participation_SAT) - 2,
0, 1)
# Initialize a `matplotlib` "figure."
fig = plt.figure(figsize=(8,5))
# Get the current "axis" out of the figure.
ax = fig.gca()
# Plot the lines using `matplotlib`'s plot function:
ax.plot(xpoints, ypoints, linewidth=3, color=reg2_color)
# Plot a vertical line for our measured difference in rates' t-statistic.
ax.axvline(t_stat, color=reg_color, linestyle='--', lw=5)
ax.axvline(-t_stat, color=reg_color, linestyle='--', lw=5);
Since the p_vale is small, we can reject the null hypothesis here. There is a correlation between the participation rates in the ACT and the SAT. In states where ACT participation is high, SAT participation is low.
n = df.shape[0]
participation_act_mean = df['Participation_ACT'].mean()
participation_act_std_error = df['Participation_ACT'].std()/np.sqrt(n)
interval_act = stats.norm.interval(0.95, participation_act_mean, participation_act_std_error)
print("The 95% confidence interval for the participation of the ACT is [{},{}].".format(interval_act[0],interval_act[1]))
participation_sat_mean = df['Participation_SAT'].mean()
participation_sat_std_error = df['Participation_SAT'].std()/np.sqrt(n)
interval_sat = stats.norm.interval(0.95, participation_sat_mean, participation_sat_std_error)
print("The 95% confidence interval for the participation rate of the SAT is [{},{}].".format(interval_sat[0],interval_sat[1]))
This is not good data to use since the populations of the states vary so greatly. Without including the enrollment data in the analysis, the test here lacks much meaning.
I'm not sure what the confidence intervals for the state participation rates for the SAT and ACT reflect, beacuse the data is not normalized by population sizes.
np.correlate(df.Participation_ACT, df.Participation_SAT)
Did you say correlations? Let's make a heat map!
# Heat map of colmns
cols = ['Enroll','Participation_SAT','Participation_SAT_count','Participation_ACT','Participation_ACT_count','Total_SAT','Composite_ACT',]
colors = sns.color_palette("GnBu_d", 8)
plt.figure(figsize=(10,10))
sns.set_palette(colors)
# sns.heatmap(df_train[top_corr].corr(),vmin=-1,vmax=1, cmap = colors);
sns.heatmap(df[cols].corr(),vmin=-1,vmax=1, cmap = colors, annot=True);
Maybe for states that have similar participation rates for both tests it would be appropriate to generate the correlation. But since states with high rates of participation have lower scores in general, it is probably not appropriate to compare in states where the participation rates are very different.
The samples are not random, so I think that statistical inference is not appropriate.
# Setting up colors for the notebook
sns.palplot(sns.color_palette("husl", 8))
colors = ['#f67088', '#ce8f31', '#96a331', '#32b165', '#35aca4', '#38a7d0', '#a38cf4', '#f461dd']